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[57] ABSTRACT 

A multinode parallel -processing computer is made up of 
a plurality of innerconnected, large capacity nodes each 
including a reconfigurable pipeline of functional units 
such as Integer Arithmetic Logic Processors, Floating 
Point Arithmetic Processors, Special Purpose Proces- 
sors, etc. The reconfigurable pipeline of each node is 
connected to a multiplane memory by a Memory-ALU 
switch NETwork (MASNE'p. The reconfigurable 
pipeline includes three (3) basic substructures formed 
from functional units which have been found to be 
sufficient to perform the bulk of all calculations. The 
MASNET controls the flow of signals from the mem- 
ory planes to the reconfigurable pipeline and vice versa, 
the nodes are connectable together by an intemode data 
router (hyperspace router) so as to form a hypercube 
configuration. The capability of the nodes to condition- 
ally configure the pipeline at each tick of the clock, 
without requiring a pipeline flush, permits many power- 
ful algorithms to be implemented directly. 

24 Claims, 9 Drawing Sheets 
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MULTINODE RECONFIGURABLE PIPELINE 
COMPUTER 

GOVERNMENT RIGHTS 5 

This invention was made with Goverment support 
under control-N AG- 1-494 awarded by NASA. The 
Goverment has certain rights in this invention. 

BACKGROUND OF THE INVENTION io 

1. Field of the Invention 

The invention relates to a computer formed of many 
nodes in which each of the nodes includes a reconfigu- 
rable, many-function ALU pipeline connected to multi- 
ple, independent memory planes through a multi-func- 15 
tion memory-ALU network switch (MASNET) and 
the multiple nodes are connected in a hypercube topol- 
ogy* 

2. Description of Related Art 

The computer of the present invention Is both a paral- 20 
lei and a pipelined machine. The prior art does disclose 
in certain limited contexts the concept of parallelism 
and pipelining. See, for example, U.S. Pat. No. 
4,589,067. However, the internal architecture of the 
present invention is unique in that it allows for most, if 25 
not all of the computer building blocks being simulta- 
neously active. U.S. Pat. No. 4,589,067 is typical of the 
prior art in that it describes a vector processor based 
upon a dynamically reconfigurable ALU pipeline. This 
processor is similar to a single functional unit of the 30 
present invention’s reconfigurable pipeline. In one sense 
the pipeline of the present invention’s node is thus a 
pipeline of pipelines. Other structures that possibly 
merit comparison with the present invention are the 
Systolic Array by Kung, The MIT Data-Flow Concept 35 
and the concept of other parallel architectures. 

The Systolic Array concept by H. T. Kung of Carne- 
gie Melon University involves data which is “pumped” 

(i.e. flows) through the computer as “waves”. Unlike 
the present invention, the Systolic Array system is com- 40 
prised of homogenous building blocks where each 
building block performs a given operation. In the Sys- 
tolic Array computer, as data flows through, the inter- 
connection between identical building blocks remains 
fixed during a computation. At best, the configuration 45 
cannot be changed until all data is processed by the 
Systolic Array. In the present invention, by contrast, 
the interconnection between building blocks can be 
changed at any time, even when data is passing through 
the pipeline (i.e. dynamic reconfiguration of intercon- 50 
nects). The present invention is also distinct from the 
Systolic Array concept in that each building block (i.e. 
functional unit) of the node pipeline of the present in- 
vention can perform a different operation from its 
neighbors (e.g. functional unit 1 - floating point multi- 55 
ply; functional unit 2 -integer minus; functional unit 3 - 
logical compare, etc.). In addition, during the course of 
computation, each building block of the present inven- 
tion can assume different functionalities (i.e. reconfigu- 
ration of functionality). 60 

The MIT Data-Flow computer is comprised of a 
network of hardware-invoked instructions that may be 
connected in a pipeline arrangement. The instruction 
processing is asynchronous to the “data-flow”. Each 
data word is appended with a field of token bits which 65 
determines the routing of the data to the appropriate 
data instruction units. Each instruction unit has a data 
queue for each operand input. The instruction does not 
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“fire” (i.e. execute) until all operands are present. The 
present invention include the concept of data flowing 
through a pipeline network of hardware functional units 
that perform operations on data (e.g. act as instructions 
that process data). However, by contrast, the present 
invention does not function in an asynchronous mode. 
Instead, data is fetched from memory and is routed by a 
switch (MASNET) to pipelined instruction units 
through the centralized control of a very high speed 
microsequencing unit. This synchronous control se- 
quence is in sharp contrast to the asynchronous distrib- 
uted data routing invoked by the Data Flow architec- 
ture. 

Moreover, the present invention, unlike the Data- 
Flow Machine, has no token field (i.e. a data field that 
guides the data to the approriate functional unit) nor do 
the functional units have queues (i.e. buffers that hold 
operands, instructions, or results). The Data-Flow Ma- 
chine has functional units waiting for data. The present 
invention has functional units that are continuously 
active. The control of the pipeline of the present inven- 
tion is achieved by a central controller, referred to as a 
microsequencer, whereas the Data-Flow Machine uses 
distributed control. The present invention also has the 
ability to reconfigure itself based upon internal flow of 
data using the TAG field, a feature not found in Data- 
Flow machine. Furthermore, the Data-Flow computer 
does not effectively perform series of like or dissimilar 
computations on continuous streams of vector data (i.e. 
a single functional operation on all data flowing 
through the pipeline). In contrast the present invention 
performs this operation quite naturally. 

There are two other principal differences between 
the parallel architecture of the present invention and 
other parallel architectures. First, each node of the 
present invention involves a unique memory/processor 
design (structure). Other parallel architectures involve 
existing stand-alone computer architectures augmented 
for interconnection with neighboring nodes. Second, 
other general multiple-processors/parallel computers 
use a central processing unit to oversee and control 
interprocessor communications so that local processing 
is suspended during global communications. The nodes 
of the present invention, by contrast, use an interproces- 
sor router and cache memory which allows for commu- 
nications without disturbing local processing of data. 

The following U.S. Patents discuss programmable or 
reconfigurable pipeline processors; 3,787,673; 3,875,391; 
3,990,732; 3,978,452; 4,161,036; 4,225,920; 4,228,497; 
4,307,447; 4,454,489; 4,467,409; and 4,482,953. A useful 
discussion of the history of both programmable and 
non-pro grammable pipeline processors is found in col- 
umns 1 through 4 of U.S. Pat. No. 4,594,655. In addi- 
tion, another relevant discussion of the early efforts to 
microprogram pipeline computers is found in the article 
entitled PROGRAMMING OF PIPELINED PRO- 
CESSORS by Peter M. Kogge from the March 1977 
edition of COMPUTER ARCHITECTURE pages 
63-69. 

Lastly, the following U.S. Patents are cited for their 
general discussion of pipelined processors: 4,051,551; 
4,101,960; 4,174,514; 4,244,019; 4,270,181; 4,363,094; 
4,438,494; 4,442,498; 4,454,578; 4,491,020; 4,498,134 and 
4,507,728. 
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SUMMARY OF THE INVENTION 

Briefly described, the present invention uses a small 
number (e.g. 128) of powerful nodes operating concurr- 
rently. The individual nodes need not be, but could be, 
synchronized. By limiting the number of nodes, the 
total communications and related hardware and soft- 
ware that is required to solve any given problem is kept 
to a manageable level, while at the same time, using to 
advantage the gain and speed and capacity that is inher- 
ent with concurrency. In addition, the interprocessor 
communications between nodes of the present invention 
that do occur, do not interrupt the local processing of 
data within the node. These features provide for a very 
efficient means of processing large amounts of data 
rapidly. Each node of the present invention is compara- 
ble to the speed and performance to Class VI supercom- 
puters (e.g. Cray 2 Cyber 205, etc.). Within a given 
node the computer uses many (e.g. 30) functional units 
(e.g. floating point arithmetic processors, integer arith- 
metic/logic processors, special-purpose processors, 
etc.) organized in a synchronous, dynamically-recon- 
figurable pipeline such that most, if not all, of the func- 
tional units are active during each clock cycle of a given 
node. This architectural design serves to minimize the 
storage of intermediate results in memory and assures 
that the sustained speed of typical calculation is close to 
the peak speed of the machine. This, for example, is not 
the case with existing Class VI supercomputers where 
the actual sustained speed for a given computation is 
much less than the peak speed of the machine. In addi- 
tion, the invention further provides for flexible and 
general interconnection between the multiple planes of 
memory, the dynamically reconfigurable pipeline, and 
the interprocessor data routers. 

Each node of the present invention includes a recon- 
figurable arithmetic/logic unit (ALU), a multiplane 
memory and a memory-ALU network (MASNET) 
switch for routing data between the memory planes and 
the reconfigurable ALU. Each node also includes a 
microsequencer and a microcontroller for directing the 
timing and nature of the computations within each 
node. Communication between nodes is controlled by a 
plurality of hyperspace routers. A front end computer 
associated with significant off-line mass storage pro- 
vides the input instructions to the multi-node computer. 
The preferred connection topology of the node is that 
of a boolean hypercube. 

The reconfigurable ALU pipeline within each node 
preferably comprises pipeline processing elements in- 
cluding floating-point processors, integer/logic proces- 
sors and special-purpose elements. The processing ele- 
ments are wired into substructures that are known to 
appear frequently in many user applications. Three 
hardwired substructures appear frequently within the 
reconfigurable ALU pipeline. One substructure com- 
prises a two element unit, another comprises a three-ele- 
ment unit and the last substructure comprises a one -ele- 
ment unit. The three-element substructure is found typi- 
cally twice as frequently as the two element substruc- 
ture and the two element substructure is found typically 
twice as frequently as the one element substructure. The 
judicious use of those substructures helps to reduce the 
complexity of the switching network employed to con- 
trol the configuration of the ALU pipeline. 

The invention will be further understood by refer- 
ence to the following drawings. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. I illustrates an embodiment of the multinode 
computer arranged in a two-dimension nearest-neigh- 
bor grid which is a subset of the boolean hypercube. 

FIG. 2 is a schematic diagram of an individual node 
illustrating the memory /MASNET/ ALU circuit inter- 
connections. 

FIG. 3 is a schematic diagram illustrating the layout 
of one memory plane within a single node such as illus- 
trated in FIG. 2. 

FIG. 4 illustrates two typical substructures formed 
from five arithmetic/logic units as might be found 
within the reconfigurable ALU pipeline of each node. 

FIG. 5A illustrates a typical ALU pipeline organiza- 
tion and the switching network (FLONET) which al- 
lows for a change in configuration of the substructures. 

FIG. 5B illustrates a preferred embodiment of the 
interconnection of a FLONET to a grouping of the 
three common substructures in a reconfigurable ALU 
pipeline. 

FIG. 6 is, a schematic diagram of a 32-register X n-bit, 
memory /ALU network switch (MASNET) and inter- 
node communications unit where the blocks represent 
six port register files. 

FIG. 7 is a schematic diagram of a 2x2 MASNET 
which illustrates how the input data stream can source 
two output data streams with a relative shift of "p” 
elements. 

FIG. 8 is a schematic diagram of an 8-node hyper- 
cube showing the relationship of the hyperspace routers 
to the MASNET units of each node. 

DETAILED DESCRIPTION OF THE 
INVENTION 

During the course of this description, like numbers 
will be used to identify like elements according to the 
different figures which illustrate the invention. 

The computer 10 according to the preferred embodi- 
ment of the invention illustrated in FIG. 1 includes a 
plurality of multiple memory/computational units re- 
ferred to as nodes 12. Computer 10 is of the parallel- 
processing variety capable of performing arithmetic 
and logical operations with high vector and scaler effi- 
ciency and speed. Such a device is capable of solving a 
wide range of computational problems. Each node 12 is 
connected via drop-line network 18 to a front end com- 
puter 16 that provides a host environment suitable for 
multi-user program development, multinode initializa- 
tion and operation, and off-line data manipulation. 
Front-end computer 16 is connected to an offline mass 
storage unit 20 by interconnection 22. Each node 12 is 
also connected to adjacent nodes by intemode connec- 
tions 14. For purposes of clarity and illustration, only 25 
nodes 12 are illustrated with simple intemode links 14 in 
FIG. 1. However, it will be appreciated that the nodes 
12 can be connected in a general hypercube configura- 
tion and that the invention may comprise fewer or more 
than 128 nodes as the application requires. Rather than 
interconnect a large number of relatively slow micro- 
processors, as is done with other prior art parallel com- 
puters, the present invention incorporates a relatively 
small number of interconnected, large-capacity, high- 
speed powerful nodes 12. According to the preferred 
embodiment of the present invention, the configuration 
typically consists of between 1 and 128 nodes 12. This 
approach limits the number of physical and logical in- 
terconnects 14 between nodes 12. The preferred con- 
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nection topology is that of a boolean hypercube. Each 
of the nodes 12 of the computer 10 is comparable to a 
class VI supercomputer in processing speed and capac- 
ity. 

The details of a typical individual node 12 are illus- 5 
trated in FIG, 2. Each node 12, which is the building 
block of the computer 10, is comprised of five (5) basic 
elements, namely: (1) a reconfigurable ALU pipeline 24 
having many (e.g. 9 or more) high-performance and 
special-purpose elements 62 (2) a group 28 of indepen- 
dent memory planes 30, (3) a non-blocking multiplein- 
put and multiple-output switching MASNET 
(Memory/ ALU Switch network) 26, (4) a mi- 
crosequencer 40 and (5) a microcontroller 42. FIG. 2 
illustrates such a node 12 which includes 8 memory 15 
planes 30 connected to a reconfigurable pipeline 24 by 
memory-ALU network switch (MASNET) 26. As used 
in this description the terms “processing elements”, 
“functional unit”, “programmable processors” and 
“building blocks” refer to arithmetic/logic units 62 20 
which comprise either floating point arithmetic proces- 
sors, integer/arithmetic/logic processors, special-pur- 
pose processors or a combination of the foregoing. 

Microsequencer 40 is connected via lines 46 to mem- 
ory 28, MASNET 26 and reconfigurable ALU pipeline 25 
24 respectively. Similarly, microcontroller 42 is con- 
nected to the same elements via lines 44. Mi- 
crosequencer 40 governs the clocking of data between 
and within the various elements and serves to define 
data pathways and the configuration of pipeline 24 for 30 
each clock tick of the node 12. In a typical operation, a 
new set of operands is presented to the pipeline 24 and 
a new set of results is derived from the pipeline 24 on 
every clock of the node 12. Microsequencer 40 is re- 
sponsible for the selection of the microcode that defines 35 
the configuration of pipeline 24, MASNET 26 and 
memory planes 30. In typical operation, the addresses 
increase sequentially in each clock period from a spe- 
cific start address until a specified end address is 
reached. The address ramp is repeated continually until 40 
an end-of-computation interrupt flag is issued. The ac- 
tual memory address used by a given plane 30 of mem- 
ory 28 may differ from the microsequencer 40 address 
depending upon the addressing mode selected. (See 
discussion concerning memory planes below). 

Microcontroller 42, also referred to as a node man- 
ager, is used to initialize and provide verification of the 
various parts of the node 12. For a given computation, 
after the initial set up, control is passed to the mi- 
crosequencer 40 which takes over until the computation 50 
is complete. In principal, microcontroller 42 does not 
need to be active during the time that computations are 
being performed although in a typical operation the 
microcontroller 42 would be monitoring the progress of 
the computation and preparing unused parts of the com- 55 
puter for the next computation. 

In addition to the five basic elements which constitute 
a minimal node 12, each node 12 ma be expanded to 
include local mass storage units, graphic processors, 
pre-and post- processors, auxiliary data routers, and the 60 
like. Each node 12 is operable in a stand-alone mode 
because the node manager 42 is a standalone microcom- 
puter. However, in the normal case the node 12 would 
be programmed from the front-end computer 16. 

The layout of a single memory plane 30 is schemati- 65 
cally illustrated in FIG. 3 Memory planes 30 are of high 
capacity and are capable of sourcing (reading) or sink- 
ing (writing) a data word in a clock of the machine 10. 


6 

Each memory plane 30 can be enabled for read-only, 
write-only or read/write operations. The memory 
planes 30 support three possible addressing modes, 
namely: (1) direct, (2) translate and (3) computed. With 
all three modes, the working address is prefetched by 
prefetch address register 52 on the previous cycle of the 
computer 10. In the direct mode, the address from the 
microsequencer address bus 46 is used to select the 
memory element of interest. In the translate mode, the 
10 microsequencer address is used to look up the actual 
address in a large memory table of addresses. This large 
table of addresses is stored in a separate memory unit 
referred to as the translate memory bank or table 50. 
The translate table 50 can be used to generate an arbi- 
trary scan pattern through main memory bank 54. It can 
also be used to protect certain designated memory ele- 
ments from ever being over-written. The computed 
address mode allows the pipeline 24 to define the ad- 
dress of the next sourced or sinked data word. 

Reconfigurable pipeline 24 is formed of various pro- 
cessing elements shown as units 62 in FIG. 4 and a 
switch network shown as FLONET 70 in FIGS. 5A 
and 5B (FLONET is an abbreviation for Functional and 
Logical Organization NET work). Three (3) perma- 
nently hardwired substructures or units 62, 64 or 66 are 
connected to FLONET. FLONET 70 reconfigures the 
wiring of the pipelined substructures 62, 64 and 66 illus- 
trated collectively as 68 in FIG. 5A and 69 in FIG. 5B. 
The specialized reconfigurable interconnection is 
achieved b electronic switches so that new configura- 
tions can be defined within a clock period of the node 
12. An example of high-level data processing in a spe- 
cific situation is shown in FIG. 4. The pipeline process- 
ing elements include floating-point arithmetic proces- 
sors (e.g. AMD 29325, Weitek 1032/1033), integer 
arithmetic/logic units 62 (e.g. AMD 2933J% and spe- 
cial-purpose elements such as vector regeneration units 
and convergence checkers. A useful discussion related 
to the foregoing special-purpose elements can be found 
in an article entitled “Two-Dimensional, Non Steady 
Viscous Flow Simulation on the Navier Stokes Com- 
puter MiniNode”, J ScL Compute, Vol. 1, No. 1 (1986) 
by D. M. Nosenchuck, M. G. Littman and W. Flannery. 
Processing elements 62 are wired together in three (3) 
45 distinct substructures 62, 64 and 66 that have been found 
to appear frequently in many user application programs. 
Two of the most commonly used substructures 64 and 
66 are shown by the elements enclosed in dotted lines in 
FIG. 4. Substructure 64 comprises three ALU units 62 
having four inputs and one output. Two ALU units 62 
accept the four inputs in two pairs of twos. The outputs 
of the two ALU units 62 form the two inputs to the 
third ALU unit 62. Each of the three ALU units 62 are 
capable of performing floating point and interger addi- 
tion, subtraction, multiplication and division, logical 
AND, OR, NOT exclusive OR, mask, shift, and com- 
pare functions with a logical register file used to store 
constants. Substructure 66 comprises two arithmetic/- 
logic units 62 and is adapted to provide three inputs and 
one output. One of the two arithmetic/logic units 62 
accepts two inputs and produces one output that forms 
one input to the second arithmetic/logic unit 62. The 
other input to the second arithmetic/logic unit 62 
comes directly from the outside. The single output of 
substructure 66 comes from the second arithmetic/logic 
unit 62. Accordingly, substructure 66 comprises a three 
input and one output device. The third and last most 
common substructure is an individual arithmetic/logic 
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unit 62 standing alone, i.e. two inputs and one output. 
Substructures 62, 64 and 66 are permanently hardwired 
into those respective configurations, however, the re- 
configuration among those units is controlled by FLO- 
NET 70. A simplified FLONET 70 is schematically 5 
represented in FIG. 5A. For simplicity, two three-ele- 
ment substructures 64, two two-element substructures 
66 and two one-element substructures 62 are illustrated. 
This results in a twelve -functional unit, high-level 
reconfigurable pipeline 24. 10 

FIG. 5B illustrates an optimal layout of a FLONET- 
/ALU interconnect. According to the preferred em- 
bodiment of the invention 10, the optimal ratio between 
the three-element substructures 64 and the two element 
substructures 66 is in the range of 1.5 to 2.0 to 1.0 15 
( 1. 5-2.0: 1). Likewise the optimal ratio between the two 
element substructures 66 and the single-element sub- 
structures 62 is approximately 2 to 1 (2:1). Accordingly, 
FIG. 5B illustrates the optimal scenario which includes 
eight three-element substructures 64, four two-element 20 
substructures 66 and two single-element substructures 
62. The number of three element substructures 64 could 
vary between 6 and 8 according to the embodiment 
illustrated in FIG. 5B. The preferred ratios just de- 
scribed are approximate and might vary slightly from 25 
application to application. However, it has been found 
that the foregoing ratios do provide very close to opti- 
mal results. 

According to the preferred embodiment of the inven- 
tion the grouping 69 of substructure 62, 64 and 66 in 30 
FIG. 5B have the functional units, or building blocks, 

62 organized in the following manner: each of the three 
function units 62 (i.e. prorammable processors) in the 
eight substructures 64 would be floating point proces- 
sors like the AMD 29325; two of substructures 66 35 
would have^ach of their two functional units 62 in the 
form of floftng point processors like the AMD 29325 
whereas the remaining two substructures 66 would 
have integer/logic processors like the AMD 29332; 
lastly one of the remaining single functional units 62 40 
would be a floating point processor like the AMD 
29325 and the other remaining single functional unit 62 
would be an integer logic processor like the AMD 
29332. Alternatively, it is also possible to pair proces- 
sors to form a hybrid functional unit 62. For example, a 45 
floating point processor like the AMD 29325 could be 
paired in a manner known to those of ordinary skill in 
the art with an integer logic processor like the AMD 
29332 so that the functional unit 62 can alternate be- 
tween floating point and integer/logic. It is also possible 50 
to use a single many-function processor (floating point 
arithmetic, integer arithmetic/logic) like the Weitek 
3332 to activate the same result. 

The details of a MASNET 26 (Memory Alu Switch 
NETwork)are shown in detail with sixteen inputs and 55 
sixteen outputs in FIG. 6. MASNET 26 is made up of 
register files 72 (e.g. Weitek 1066) that are cross con- 
nected in a Benes switching network arrangement and 
pipelined so as to make the connection of any input to 
any output non-blocking. The MASNET 26 illustrated 60 
in FIG. 6 is a sixteen-by-sixteen (16x16) circuit The 
fact that each register file 72 has local memory also 
means that by using the MASNET 26 it is possible to 
reorder data as it flows through the network. This fea- 
ture can be used, for example, to create two data 65 
streams from a common source in which one is delayed 
with respect to the other by several elements. The for- 
mation of multiple data streams from a common source 
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is also a feature of MASNET 26. FIG. 7 illustrates more 
explicitly how a 2x2 MASNET (i.e. a single register 
file 72) can achieve both of these simple tasks. 

MASNET 26 is used also for intemode communica- 
tions in that it routes data words corresponding to the 
nodal boundaries to bordering nodes 12 through hyper- 
space routers 80. This routing is achieved as the data 
flows through the MASNET 26 without the introduc- 
tion of any additional delays. Likewise, the hyperspace 
router 80 of a given node 12 can inject needed boundary 
point values into the data stream as they are needed 
without the introduction of any delays. A more detailed 
discussion of intemode communications follows. 

The global topology of the multinode computer 10 is 
that of a hypercube. The hypercube represents a com- 
promise between the time required for arbitrary inter- 
node communications, and the number of physical in- 
terconnections between nodes 12. Two addressing 
modes support intemode data communications, namely: 
(1) global addressed and (2) explicit boundary-point 
definition, or BPD. Global addressing is simply ex- 
tended addressing, where an address specifies the 
node/memory-plane/offset of the data. From a soft- 
ware standpoint, the address is treated as a simple linear 
address whose range extends across all nodes in the 
computer 10. Intemode communications is handled by 
software and is entirely transparent to the programmer 
if default arbitration and communications-lock parame- 
ters are chosen. BPD involves the explicit definition of 
boundary points, their source, and all destination ad- 
dresses. Whenever BPD data is generated, it is immedi- 
ately routed to BDP caches 82 in the destination nodes 
12 as illustrated in FIG. 8. Local addressing and BPD 
may be intermixed. The main advantage of global ad- 
dressing over BPD is software simplicity, although 
BPD has the capability of eliminating most intemode 
communications overhead by precommunicating 
boundary-point data before they are requested by other 
nodes. 

Data are physically routed between nodes 12 using 
local switching networks attached to each node 12. The 
local switching networks previously referred to as hy- 
perspace routers 80 are illustrated in FIG. 8. Hyper- 
space routers 80 are non-blocking permutation net- 
works with a topology similar to the Benes network. 
For a multinode class computer or order d (i.e., 
NN = 2 rf , NN — number of nodes), the hyperspace 
router permits d-bl inputs which includes d neighbor- 
ing nodes 12 plus one additional input for the host node 
12. The data are self-routing in that the destination 
address, carried with the data, is used to establish hyper- 
space router switch states. An eight node system is 
illustrated in FIG. 8. In this example, d = 3, and each 
hyperspace router 80 has a 4x4 network with a delay of 
three minor clocks. For 3<d<8 where small d is an 
integer, an 8 X 8 router 80 is required, with d = 7 provid- 
ing complete switch utilization. Since the hyperspace 
router 8 must be configured for ln 2 d = 1 inputs, optimal 
hardware performance is given by a computer array 
having the size of 


NN = , n = 0 , 1 , 2 , 3 . . . 

Configurations of 1, 2, 8, 128, . ♦ , nodes fully utilize 
the hyperspace routers 80. Multinode computer config- 
urations with non-integer ln 2 d are also supported, ex- 
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cept the hyperspace router 80 is scaled up to the next 
integral dimension. The implications of this are not 
severe, in that aside from the penalty of additional 
switch hardware, a slightly greater amount of storage is 
required for the permutation tables. The node stores 5 
these tables in a high-speed look up table. The length of 
the table is (d4- 1)?. When the computer grows beyond 
128 nodes, the hyperspace router increases to a 16 X 16 
switch. Since the look-up tables become prohibitively 
large, the permutation routing is then accomplished by 10 
bit-slice hardware which is somewhat slower than the 
look-up tables. These considerations have established 
128 nodes as the initial, preferred computer configura- 
tion. 

Data transmission between nodes 12 occurs over 15 
fiber-optic cables in byte-serial format at a duplex rate 
of 1 Gbyte/second. This rate provides approximately 
two orders-of-magnitude head room for occasional 
burst transmissions and also for future computer expan- 
sion. Each node 12 has a 1 Mword boundary-point 20 
write-through cache which, in the absence of host-node 
requests for cache bus cycles is continuously up-dated 
by the hyperspace router 80. Thus, current boundary 
data are maintained physically and logically close to the 
ALU pipeline inputs. 25 

While the invention has been described with refer- 
ence to the preferred embodiment thereof it will be 
appreciated that various modifications can be made to 
the parts and methods that comprise the invention with- 
out departing from the spirit and scope thereof. 30 

We claim: 

1. A multi-node, parallel processing computer appa- 
ratus comprising: 

a plurality of nodes each including an internal mem- 
ory and a reconfigurable arithmetic logic (ALU) 35 
pipeline unit and a memory /ALU/switch network 
(MASNET) for transferring data from said internal 
memory through said MASNET to said reconfigu- 
rable ALU pipeline unit and from said reconfigura- 
ble ALU pipeline unit through said MASNET to 40 
said internal memory, said reconfigurable ALU 
pipeline unit further including a first group of pro- 
grammable processors permenantly connected to- 
gether in a first configuration having four (4) inputs 
and one (1) output and a second group of program- 45 
mable processors permanently connected together 
in a second configuration different from said first 
configuration, said second group having three (3) 
inputs and one (1) output, and an ALU pipeline 
configuration switching network means (FLO- 50 
NET) for selectively connecting said first and sec- 
ond groups to each other, and sequencer means for 
providing instructions to said FLONET once a 
clock cycle; and, 

router means for routing data between said nodes, 55 
wherein said reconfigurable ALU pipeline unit selec- 
tively performs different computations according 
to instructions from said sequencer means once a 
clock cycle. 

2. A reconfigurable computer apparatus comprising: 60 
a first group of programmable processors perma- 
nently connected together in a first configuration 
having four (4) inputs and one (1) output, said first 
group including a first programmable processor 
having at least two (2) inputs and at least one (1) 65 
output; a second programmable processor having 

at least two (2) inputs and at least one (1) output; 
and, a third programmable processor having two 
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(2) inputs permanently connected to the outputs of 
said first and second programmable processors, 
said third programmable processor also having an 
output, such that the four inputs of said first group 
comprise the inputs of said first and second pro- 
grammable processors and the output of said first 
group comprises the output of said third program- 
mable processor; 

a second group of programmable processors perma- 
nently connected together in a second configura- 
tion different from said first configuration, said 
second group having three (3) inputs and one 91) 
output and including a fourth programmable pro- 
cessor having two (2) inputs and one (1) output; 
and, a fifth programmable processor having two (2) 
input and one (1) output, one of said inputs of said 
fifth programmable processor being permanently 
connected to the output of said fourth programma- 
ble processor, such that the three (3) inputs of said 
second group comprise the two (2) inputs to said 
fourth programmable processor and the input to 
said fifth programmable processor not connected 
to the output of said fourth programmable proces- 
sor, and the output of said second group compris- 
ing the output of said fifth programmable proces- 
sor; 

a third group or programmable processors compris- 
ing individual processors having two (2) inputs and 
one (1) output; 

switching means (FLONET) for selectively connect- 
ing said first, second and third groups together; 
and, 

sequencer means for providing instructions to said 
FLONET once a clock cycle, 

wherein said apparatus selectively performs different 
computations according to instructions from said 
sequencer means once a clock cycle. 

3 . A reconfigurable computer apparatus including 
arithmetic/logic units (ALU), said apparatus compris- 
ing: 

at least a first substructure including three (3) ALU 
units permanently connected together in a first 
configuration having four (4) inputs and one (l) 
output; 

at least a second substructure including two (2) ALU 
units permanently connected together in a second 
configuration having three (3) inputs and one (1) 
output; 

at least a third substructure including at least one 
individual ALU unit having two (2) outputs and 
one (l) output; 

switching means for selectively connecting said first, 
second and third substructure together; and, 

sequencer means for providing instructions to said 
switching means, 

wherein said apparatus selectively performs compu- 
tations according to instructions from said se- 
quencer means. 

4 . A node apparatus for use in a multi-node, parallel 
processing system, said node apparatus comprising: 

an internal memory including a plurality of memory 
planes; 

a dynamically reconfigurable arithmetic logic (ALU) 
pipeline means for performing computations, in- 
cluding a plurality of ALUs at least three of which 
are permanently connected to each other; 

an ALU pipeline configuration switching network 
means (FLONET) for selectively connecting 
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groups of said ALUs in said dynamically recon- 
figurable arithmetic logic pipeline means together; 

a memory/ALU/switch network (dASNET) for 
transferring data from the memory planes of said 
internal memory through said MASNET to said 5 
dynamically reconfigurable ALU pipeline means 
and from said dynamically reconfigurable ALU 
pipeline means through said MASNET to said 
internal memory; and, 

sequencer means for providing instructions to said jq 
FLONET, 

wherein said dynamically reconfigurable ALU pipe- 
line means selectively performs different computa- 
tions according to instructions from said sequencer 
means. 15 

5. The apparatus of claim 29 wherein said first group 
of programmable processors comprises: 

a first programmable processor haying at least two 
(2) inputs and at least one (1) output; 

a second programmable processor having at least two 2 q 
( 2) inputs and at least one (1) output; and, 

a third programmable processor having two (2) in- 
puts permanently connected to the outputs of said 
first and said second programmable processors, 
said third programmable processor also having an 25 
output, 

wherein the inputs to said first group comprise the 
inputs of said first and second programmable pro- 
cessors and the output of said first group comprises 
the output of said third programmable processor. 

6. The apparatus of claim 5 wherein said second 
group of programmable processors comprise: 

a fourth programmable processor having at least two 
(2) inputs and at least one (1) output; and, 

a fifth programmable processor having two (2) inputs 
and one (1) output, one of said inputs of said fifth 35 
programmable processor being permanently con- 
nected to the output of said fourth programmable 
processor, 

wherein the inputs of said second group comprise the 
two inputs to said fourth programmable processor 40 
and the one input to said fifth programmable pro- 
cessor not connected to the output of said fourth 
programmable processor, and the output of said 
second group comprises the output of said fifth 
programmable processor. 45 

7. The apparatus of claim 6 wherein said reconfigura- 
ble ALU pipeline unit further comprises: 

a third group of programmable processors compris- 
ing individual programmable processors connected 
to said FLONET for selective connection with 50 
said first and second groups of programmable pro- 
cessors. 

8. The apparatus of claim 7 wherein the ratio of said 
first group of programmable processors with respect to 
said second group of programmable processors in a 55 
given reconfigurable ALU pipeline unit is approxi- 
mately in the range of 1. 5-2.0 to 1.0. 

9. The apparatus of claim 8 wherein the ratio of said 
second group of programmable processors to said third 
group of programmable processors is approximately 2.0 60 
to 1.0. 

10. The apparatus of claim 9 wherein said internal 
memory comprises a plurality of memory planes. 

11. The apparatus of claim 10 wherein each memory 

plane comprises: 65 

a main memory bank; 

an address multiplexer for transmitting data to and 
from said main memory bank; 
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a prefetch address register connected between said 
main memory bank and said address multiplexer; 
and, 

a translate table means connected to said address 
multiplexer for scanning said assembly bank in a 
random access manner. 

12. The apparatus of claim 11 wherein said sequencer 
means further comprises: 

microsequencer means connected to said internal 
memory, MASNET and reconfigurable ALU pipe- 
line unit for governing the clocking of data be- 
tween said internal memory, MASNET and said 
reconfigurable ALU pipeline unit. 

13. The apparatus of claim 12 wherein each node 
further comprises: 

a microcontroller connected to said internal memory, 
MASNET and said reconfigurable ALU pipeline 
unit for initializing and verifying the status of said 
internal memory, MASNET and reconfigurable 
ALU pipeline. 

14. The apparatus of claim 13 wherein said MAS- 
NET comprises: 

a plurality of register files cross connected in a Benes 
switching network arrangement and pipelined so as 
to make the connection of any input to any output 
non-blocking. 

15. The apparatus of claim 14 further comprising: 

boundary-point definition (BPD) cache means con- 
nected between said router means and said MAS- 
NET for routing BPD data to specific destination 
nodes, 

wherein said apparatus supports both global address- 
ing and explicit BPD addressing modes. 

16. The apparatus of claim 15 further comprising: 

a front end computer for feeding data and instruc- 
tions to said nodes; and, 

off-line mass storage means connectable to said front 
end computer. 

17. The apparatus of claim 16 wherein said nodes are 
connected together in the topology of a boolean hyper- 
cube and vary in number in the range of from 1 to 128. 

18. The apparatus of claim 2 further comprising: 

an internal memory; and, 

a memory-ALU switch network means (MASNET) 
for transferring data from said internal memory 
through said MASNET to said switching means 
and for transferring data from said switching means 
through said MASNET to said internal memory. 

19. The apparatus of claim 18 wherein said sequences 
means further comprises: 

micro sequencer means connected to said internal 
memory, MASNET and switching means for gov- 
erning the clocking of data between said internal 
memory, MASNET and switching means. 

20. The apparatus of claim 19 further comprising: 

microcontroller means connected to said internal 

memory, MASNET and switching means for ini- 
tializing and verifying the status of said internal 
memory, MASNET and switching means. 

21. The apparatus of claim 2 wherein at least some of 
said processors comprise floating point arithmetic pro- 
cessors. 

22. The apparatus of claim 2 wherein at least some of 
said processors comprise integer arithmetic logic pro- 
cessors. 

23. The apparatus of claim 2 wherein the ratio of said 
first group of programmable processors with respect to 
said second group of programmable processors is ap- 
proximately in the range of 1. 5-2.0 to 1.0. 

24. The apparatus of claim 2 wherein the ratio of said 
second group of programmable processors to said third 
group of programmable processors is approximately 2.0 
to 1.0. 

***** 



